AITopics | value estimator

Recent advances in large language models (LLMs) have increasingly relied on reinforcement learning (RL) to improve their reasoning capabilities. Three types of approaches have been widely adopted: The first relies on a deep neural network to estimate the value function of the learning policy in order to reduce the variance of the policy gradient. However, estimating and maintaining such a value network incurs substantial computational and memory overhead. The second avoids training a value network by approximating the value function using sample averages. However, it samples a large number of reasoning traces per prompt for accurate value function approximation, making it computationally expensive. The third samples only a single reasoning trajectory per prompt, which reduces computational cost but suffers from poor sample efficiency. This paper focuses on a practical, resource-constrained setting in which only a small number of reasoning traces can be sampled per prompt, while low-variance gradient estimation remains essential for high-quality policy learning. To address this challenge, we bring classical nonparametric statistical methods, which are both computationally and statistically efficient, to LLM reasoning. We employ kernel smoothing as a concrete example for value function estimation and the subsequent policy optimization. Numerical and theoretical results demonstrate that our proposal achieves accurate value and gradient estimation, leading to improved policy optimization.

large language model, machine learning, reinforcement learning, (19 more...)

arXiv.org Machine Learning

2604.28005

Genre: Research Report > New Finding (0.87)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.89)

Add feedback

0c215f194276000be6a6df6528067151-Supplemental.pdf

Neural Information Processing SystemsApr-24-2026, 15:49:55 GMT

artificial intelligence, bottleneck, machine learning, (17 more...)

Neural Information Processing Systems

Genre: Research Report > New Finding (0.93)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Search (0.69)

Add feedback

0c215f194276000be6a6df6528067151-Paper.pdf

Neural Information Processing SystemsApr-24-2026, 15:49:52 GMT

machine learning, natural language, reinforcement learning, (16 more...)

Neural Information Processing Systems

Country:

North America > United States (0.28)
North America > Canada > Quebec (0.28)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

816b112c6105b3ebd537828a39af4818-Paper.pdf

Neural Information Processing SystemsFeb-9-2026, 13:57:26 GMT

estimator, kernel-based method, value estimator, (14 more...)

Neural Information Processing Systems

Country:

North America > United States > North Carolina (0.04)
Europe > United Kingdom > England > Greater London > London (0.04)

Genre: Research Report (0.68)

Industry: Health & Medicine > Therapeutic Area (0.47)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Data Science (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.67)

Add feedback

0c215f194276000be6a6df6528067151-Supplemental.pdf

Neural Information Processing SystemsFeb-7-2026, 11:22:23 GMT

agent, bottleneck, ood 0, (16 more...)

Neural Information Processing Systems

Genre: Research Report > New Finding (0.93)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Search (0.69)

Add feedback

Generative Large-Scale Pre-trained Models for Automated Ad Bidding Optimization

Lei, Yu, Zhao, Jiayang, Zhao, Yilei, Zhang, Zhaoqi, Cai, Linyou, Xie, Qianlong, Wang, Xingxing

arXiv.org Artificial IntelligenceDec-9-2025

Modern auto-bidding systems are required to balance overall performance with diverse advertiser goals and real-world constraints, reflecting the dynamic and evolving needs of the industry. Recent advances in conditional generative models, such as transformers and diffusers, have enabled direct trajectory generation tailored to advertiser preferences, offering a promising alternative to traditional Markov Decision Process-based methods. However, these generative methods face significant challenges, such as the distribution shift between offline and online environments, limited exploration of the action space, and the necessity to meet constraints like marginal Cost-per-Mille (CPM) and Return on Investment (ROI). To tackle these challenges, we propose GRAD (Generative Reward-driven Ad-bidding with Mixture-of-Experts), a scalable foundation model for auto-bidding that combines an Action-Mixture-of-Experts module for diverse bidding action exploration with the Value Estimator of Causal Transformer for constraint-aware optimization. Extensive offline and online experiments demonstrate that GRAD significantly enhances platform revenue, highlighting its effectiveness in addressing the evolving and diverse requirements of modern advertisers. Furthermore, GRAD has been implemented in multiple marketing scenarios at Meituan, one of the world's largest online food delivery platforms, leading to a 2.18% increase in Gross Merchandise Value (GMV) and 10.68% increase in ROI.

constraint, large language model, machine learning, (18 more...)

arXiv.org Artificial Intelligence

2508.02002

Country: Asia > China (0.15)

Genre:

Research Report (0.64)
Instructional Material (0.46)

Industry:

Marketing (0.94)
Information Technology > Services (0.67)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.72)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.68)

Add feedback

816b112c6105b3ebd537828a39af4818-Paper.pdf

Neural Information Processing SystemsAug-15-2025, 12:18:46 GMT

estimator, kernel-based method, value estimator, (14 more...)

Neural Information Processing Systems

Country:

North America > United States > North Carolina (0.04)
Europe > United Kingdom > England > Greater London > London (0.04)

Genre: Research Report (0.46)

Industry: Health & Medicine > Therapeutic Area (0.47)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Data Science (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.67)

Add feedback

A Look at Value-Based Decision-Time vs. Background Planning Methods Across Different Settings

Alver, Safa, Precup, Doina

arXiv.org Artificial IntelligenceAug-12-2024

In model-based reinforcement learning (RL), an agent can leverage a learned model to improve its way of behaving in different ways. Two of the prevalent ways to do this are through decision-time and background planning methods. In this study, we are interested in understanding how the value-based versions of these two planning methods will compare against each other across different settings. Towards this goal, we first consider the simplest instantiations of value-based decision-time and background planning methods and provide theoretical results on which one will perform better in the regular RL and transfer learning settings. Then, we consider the modern instantiations of them and provide hypotheses on which one will perform better in the same settings. Finally, we perform illustrative experiments to validate these theoretical results and hypotheses. Overall, our findings suggest that even though value-based versions of the two planning methods perform on par in their simplest instantiations, the modern instantiations of value-based decision-time planning methods can perform on par or better than the modern instantiations of value-based background planning methods in both the regular RL and transfer learning settings.

algorithm, deep dyna-q algorithm, dyna-q algorithm, (15 more...)

arXiv.org Artificial Intelligence

2206.08442

Country:

North America > Canada > Quebec > Montreal (0.28)
North America > Canada > Alberta (0.04)
North America > Barbados (0.04)
Asia > Japan > Honshū > Chūbu > Toyama Prefecture > Toyama (0.04)

Genre: Research Report > New Finding (1.00)

Industry: Leisure & Entertainment > Games (0.93)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.93)

Add feedback

Value Augmented Sampling for Language Model Alignment and Personalization

Han, Seungwook, Shenfeld, Idan, Srivastava, Akash, Kim, Yoon, Agrawal, Pulkit

arXiv.org Artificial IntelligenceMay-10-2024

Aligning Large Language Models (LLMs) to cater to different human preferences, learning new skills, and unlearning harmful behavior is an important problem. Search-based methods, such as Best-of-N or Monte-Carlo Tree Search, are performant, but impractical for LLM adaptation due to their high inference cost. On the other hand, using Reinforcement Learning (RL) for adaptation is computationally efficient, but performs worse due to the optimization challenges in co-training the value function and the policy. We present a new framework for reward optimization, Value Augmented Sampling (VAS), that can maximize different reward functions using data sampled from only the initial, frozen LLM. VAS solves for the optimal reward-maximizing policy without co-training the policy and the value function, making the optimization stable, outperforming established baselines, such as PPO and DPO, on standard benchmarks, and achieving comparable results to Best-of-128 with lower inference cost. Unlike existing RL methods that require changing the weights of the LLM, VAS does not require access to the weights of the pre-trained LLM. Thus, it can even adapt LLMs (e.g., ChatGPT), which are available only as APIs. In addition, our algorithm unlocks the new capability of composing several rewards and controlling the extent of each one during deployment time, paving the road ahead for the future of aligned, personalized LLMs.

estimator, language model alignment, value augmented sampling, (11 more...)

arXiv.org Artificial Intelligence

2405.06639

Country:

North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
Europe > Switzerland (0.04)
Asia > Singapore (0.04)
Asia > Middle East > Jordan (0.04)

Genre: Research Report (1.00)

Industry:

Health & Medicine (0.67)
Government (0.67)
Automobiles & Trucks (0.67)
Education (0.66)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

How Can LLM Guide RL? A Value-Based Approach

Zhang, Shenao, Zheng, Sirui, Ke, Shuqi, Liu, Zhihan, Jin, Wanxin, Yuan, Jianbo, Yang, Yingxiang, Yang, Hongxia, Wang, Zhaoran

arXiv.org Artificial IntelligenceFeb-25-2024

Reinforcement learning (RL) has become the de facto standard practice for sequential decision-making problems by improving future acting policies with feedback. However, RL algorithms may require extensive trial-and-error interactions to collect useful feedback for improvement. On the other hand, recent developments in large language models (LLMs) have showcased impressive capabilities in language understanding and generation, yet they fall short in exploration and self-improvement capabilities for planning tasks, lacking the ability to autonomously refine their responses based on feedback. Therefore, in this paper, we study how the policy prior provided by the LLM can enhance the sample efficiency of RL algorithms. Specifically, we develop an algorithm named LINVIT that incorporates LLM guidance as a regularization factor in value-based RL, leading to significant reductions in the amount of data needed for learning, particularly when the difference between the ideal policy and the LLM-informed policy is small, which suggests that the initial policy is close to optimal, reducing the need for further exploration. Additionally, we present a practical algorithm SLINVIT that simplifies the construction of the value function and employs subgoals to reduce the search complexity. Our experiments across three interactive environments ALFWorld, InterCode, and BlocksWorld demonstrate that our method achieves state-of-the-art success rates and also surpasses previous RL and LLM approaches in terms of sample efficiency. Our code is available at https://github.com/agentification/Language-Integrated-VI.

arxiv preprint arxiv, language model, llm, (14 more...)

arXiv.org Artificial Intelligence

2402.16181

Country: